Assessing the transportability of clinical prediction models for cognitive impairment using causal models

Background Machine learning models promise to support diagnostic predictions, but may not perform well in new settings. Selecting the best model for a new setting without available data is challenging. We aimed to investigate the transportability by calibration and discrimination of prediction models for cognitive impairment in simulated external settings with different distributions of demographic and clinical characteristics. Methods We mapped and quantified relationships between variables associated with cognitive impairment using causal graphs, structural equation models, and data from the ADNI study. These estimates were then used to generate datasets and evaluate prediction models with different sets of predictors. We measured transportability to external settings under guided interventions on age, APOE ε4, and tau-protein, using performance differences between internal and external settings measured by calibration metrics and area under the receiver operating curve (AUC). Results Calibration differences indicated that models predicting with causes of the outcome were more transportable than those predicting with consequences. AUC differences indicated inconsistent trends of transportability between the different external settings. Models predicting with consequences tended to show higher AUC in the external settings compared to internal settings, while models predicting with parents or all variables showed similar AUC. Conclusions We demonstrated with a practical prediction task example that predicting with causes of the outcome results in better transportability compared to anti-causal predictions when considering calibration differences. We conclude that calibration performance is crucial when assessing model transportability to external settings. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-023-02003-6.


Supplementary Material
Assessing the transportability of clinical prediction models for cognitive impairment using causal models Jana Fehr, Marco Piccininni, Tobias Kurth, Stefan Konigorski for the Alzheimer's Disease Neuroimaging Initiative

Supplementary Text 1: Details on data processing
We obtained age, sex, education and the presence of the APOE ε4 genotype (0 or 1).  between brain amyloid-β and regional brain atrophy in 280 normal older people, 30% of whom had evidence of brain amyloid deposition although evidence for such associations were seen in those with MCI."

Supplementary Tables
Tau-deposits (proxy: CSF-Tau) → brain atrophy "While elevated CSF tau concentrations have been shown to be associated with lower grey matter density (GMD) in AD † -specific regions, this correlation has yet to be examined for plasma in a large study. Plasma tau may serve as a non-specific marker for neurodegeneration but is still relevant to AD † considering low GMD was associated with plasma tau in Aβ+ participants and not Aβ-participants."      Table S6: Transportability of algorithms trained to predict cognitive impairment, measured by differences of Integrated Calibration Index (ICI), Brier calibration component, and area under the receiver operating characteristic curve (AUC) between internal validation and intervention test sets. Three types of algorithms (logistic regression, random forest (rf) and generalized boosted regression (GBM)) were trained to predict cognitive impairment. Trained algorithms were applied predict cognitive impairment on validation datasets, which had the same distribution of age, APOE ε4 and sex variables, as in the training data and on intervention test set, which had either a younger population mean age (age and age2), lower APOE ε4 allele frequency (APOE ε4) or different CSF-tau (tau) mechanism compared to the training and internal validation datasets. Training and test data was generated with n=10,000 observations, 10,000 times based on obtained SEM-parameters of the first imputed dataset. Transportability was measured by the ICI, Brier calibration component, and AUC difference between internal validation and intervention sets. ICI, Brier, and AUC differences are given as median, 2.5% percentile and 97.5% percentile across 10,000 repetitions. Negative differences in ICI and Brier, and positive differences in AUC, indicate reduced transportability of the model in the intervention setting compared to the internal validation setting.  Figure S2: Overview of process to generate datasets for training and validating prediction models. Each dataset consists of the exogenous (independent) variables age, sex and APOE ε4 and 14 endogenous variables. The exogenous variables served as input to generate endogenous variables. Endogenous variables were generated one by one by using the defined linear equations with obtained parameter estimates from the structural equation model (SEM) and the generated exogenous and endogenous variables. The exogenous variables for the training and internal validation dataset were sampled by bootstrapping the respective variable in the original ADNI dataset 10,000 times, to achieve that the distribution of the simulated data for training and internal validation matches the ADNI data. For the age-intervention settings (age and age2), age values were randomly sampled from a normal distribution with defined mean (age: 35, age2: 65) and std. deviation (age and age2: 10), while sex and APOE ε4 values were generated from bootstrapping from the ADNI dataset. For the APOE ε4 intervention setting, the binary APOE ε4 values were sampled from a Bernoulli distribution with probability 0.05, while age and sex values were bootstrapped from the ADNI data. For the tau intervention setting, all three exogenous variables were bootstrapped from the original data, but the parameter estimates for the linear equation with tau as dependent variable were altered to generate the tau values and consequences of tau.
Supplementary Figure S4: Calibration in internal and external validation settings from 10,000 repetitions. The table summarizes results generated using SEM-parameters from the first imputed dataset. The figure shows calibration measured by the decomposed Brier score calibration component in the different external validation settings (age intervention, age2 intervention, APOE ε4 intervention, different CSF-tau mechanism Cognitive impairment was predicted using logistic regression, lasso regression, random forest (rf) and generalized boosted regression (gbm) prediction models. Models were trained either with all predictor variables, only parent nodes (direct causes) of the outcome, only children nodes (consequences) of the outcome, or with the exogenous variables age, sex, and APOE ε4 (apoe4). Points in bottom left quadrant indicate models had good calibration in internal validation and external settings. Points in top left quadrant indicate models had poor calibration in external settings but good calibration in the internal validation. Points in the top right quadrant indicate poor calibration in both internal validation and external settings.

30
Supplementary Figure S5: Transportability between internal validation and external validation settings, measured by the difference of the Brier calibration component. The ADNI data to determine parameters for the generation of training and validation data, was imputed three times. Figure a) shows the results from 10,000 repetitions on the first imputed dataset, and Figure b) and c) from 10,000 repetitions on the other two imputed datasets.

a) b) c)
Supplementary Figure S6: Area under the receiver operating curve (AUC) in internal and external validation settings from 10,000 repetitions on one imputed dataset. Cognitive impairment was predicted using logistic regression, lasso regression, random forest (rf) and generalized boosted regression (gbm) prediction models. Models were trained either with all predictor variables, only parent nodes (direct causes) of the outcome, only children nodes (consequences) of the outcome, or with the exogenous variables age, sex, and APOE ε4 (apoe4). Points in bottom left quadrant indicate models had poor discrimination performance in internal validation and external settings. Points in top right quadrant indicate good discrimination performance in both internal validation and external settings.

a) b)
Supplementary Figure S7: Transportability between internal validation and external validation settings, measured by the difference of area under the receiver operating curve (AUC). The ADNI data to determine parameters for the generation of training and validation data, was imputed three times. Figure a) shows the results from 10,000 repetitions on the first imputed dataset, and Figure b) from 10,000 repetitions on another imputed dataset.

a) b)
Supplementary Figure S8: Model performance in the internal validation setting, measured by the integrated calibration index (ICI), Brier calibration component, and area under the receiver-operating curve (AUC). The ADNI data to determine parameters for the generation of training and validation data, was imputed three times. Figure a) shows the results from 10,000 repetitions on the first imputed dataset, and Figure b) from 10,000 repetitions on another imputed dataset.

a) b)
Supplementary Figure S9: Transportability between internal validation and external validation settings, measured by the difference of integrated calibration index (ICI). The ADNI data to determine parameters for the generation of training and validation data, was imputed three times. Figure a) shows the results from 10,000 repetitions on the first imputed dataset, and Figure b) from 10,000 repetitions on another imputed dataset.
Supplementary Figure S10: Transportability of random forest (rf) models with optimized hyperparameter between internal validation and intervention test sets, measured by integrated calibration index (ICI), Brier calibration component and area under the receiver operating curve (AUC). Figure a) shows the performance in the internal validation setting. Figure b) shows the transportability, measured by differences between internal and external validation settings. The data for training and validating the random forest models was generated 100 times and hyperparameter were optimized for each training set to minimize the deviance. The optimized hyperparameter for the rf were number of predictors sampled for splitting at each node (mtry) from 1 to 5, and the minimum size of terminal nodes (nodesize) of 1, 5 or 10.